feat: improve maintainers detection [CM-1033]#3908
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
This PR improves maintainer file detection in the git integration service by adding a multi-step discovery and analysis flow that combines static filename matching, dynamic ripgrep-based content search, and an AI fallback, while also surfacing more metadata about what was tried.
Changes:
- Added ripgrep-based repo scanning (
rg --filesand keyword search) with fallback toos.walk, plus scoring/filtering of dynamic candidates. - Refactored maintainer extraction to prioritize a previously saved maintainer file, then analyze top candidates, then use AI file suggestion as a last resort.
- Extended
MaintainerResultand service execution metrics to includecandidate_filesandai_suggested_file; addedripgrepto the Docker image.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py | New candidate discovery + fallback extraction flow; logs and metrics now include candidate/AI-suggested file metadata. |
| services/apps/git_integration/src/crowdgit/models/maintainer_info.py | Adds new result metadata fields (candidate_files, ai_suggested_file). |
| scripts/services/docker/Dockerfile.git_integration | Installs ripgrep in the runner image to support dynamic search. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
services/apps/git_integration/src/crowdgit/models/maintainer_info.py
Outdated
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Outdated
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Outdated
Show resolved
Hide resolved
bc8e3df to
b4dd488
Compare
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
joanagmaia
left a comment
There was a problem hiding this comment.
This looks great, I have a couple of questions and requests to make sure that we have some more metrics given that these are big changes on the current process.
Questions:
- With the mechanism of only picking one file for analysis we are assuming that all maintainers information will only be in 1 file right? I'm not sure if we should make sure that we won't lose data because of it.
Requests:
- Can we run the new mechanism in like 10 repos and see the accuracy? I would even say on the current issues we have opened on Insights as well to see if we have improved coverage https://github.com/linuxfoundation/insights/issues?q=is%3Aissue%20state%3Aopen%20maintainer
- Can we prepare a monitor in metaplane that covers the amount of repositories where we can get maintainers data for? And also the amount of projects?
- Can we test using the Haiku model for find_maintainer_file_with_ai since it would be a simpler task then the rest of the work?
| MAX_AI_FILE_LIST_SIZE = 300 | ||
|
|
||
| # Full paths that get the highest score bonus when matched exactly | ||
| KNOWN_PATHS = { |
There was a problem hiding this comment.
We should also include SECURITY-INSIGHTS.md. It was supported before as well.
E.g. https://github.com/open-telemetry/opentelemetry-dotnet/blob/d54379e28c07db783452a33e119f1cdf8e7d96a6/SECURITY-INSIGHTS.yml#L13
| } | ||
|
|
||
| # Governance stems (basename without extension, lowercased) for filename search | ||
| GOVERNANCE_STEMS = { |
There was a problem hiding this comment.
Should we also add:
- workgroup (e.g. https://github.com/open-feature/community/blob/d2f54702a4bca67cd7781a8fed91e9809ecc4a0a/config/open-feature/sdk-ruby/workgroup.yaml#L15)
My only concern here is that it seems that they use the community repo to manage some maintainers data. So here we might need to infer the repository based on the directory structure. Maybe it's too complex for us to want to support at least for now
There was a problem hiding this comment.
It's tricky when repo and maintainers are in different places, will check how we can support this easily
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Outdated
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
5bcd908 to
0b8a57e
Compare
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts
Show resolved
Hide resolved
services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts
Show resolved
Hide resolved
0b8a57e to
1546f7e
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
… detection Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…rd in content Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…improve prompt Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…irectories Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
645633a to
bbacdfa
Compare

What changed
Before
MAINTAINER_FILES: 13 entries, root-only, no recursion).README.mdwas in the candidate list and required a simple content check for the word "maintainer".extract_maintainersalways started from scratch — no reuse of a previously found file.compare_and_update_maintainersskipped all maintainers withgithub_username == "unknown", including those with a valid email; no email fallback for identity lookup.candidate_filesandai_suggested_filedid not exist inMaintainerResultor execution metrics.After
Detection pipeline (4-step with fallback)
rgscans the full repo for files matching 20 governance stems (MAINTAINERS,OWNERS,CODEOWNERS,GOVERNANCE,EMERITUS, etc.) across all depths and valid extensions. Each file is scored: exact known path (100), exact stem match (50), partial stem (25), plus +1 per governance keyword found in content. All candidates are returned sorted by score; the top one is analyzed.maintainer.(filename, score)tuples. The prompt instructs the model to prefer higher scores, shallower paths, and to reject files insidevendor/,node_modules/,third_party/,external/, and similar third-party directories.Bug fixes
compare_and_update_maintainers: the skip guard now only fires when bothgithub_usernameandemailare unknown/None (previously skipped all"unknown"usernames unconditionally). New maintainers identified by email now go throughfind_maintainer_identity_by_emailas a fallback, matchinginsert_new_maintainersbehaviour.elsebranch, avoiding a wasted string allocation on every large file.Observability
MaintainerResultgainscandidate_files: list[tuple[str, int]]andai_suggested_file: str | None.ServiceExecutionmetrics now recordcandidate_files(top-100 by score) andai_suggested_fileon every run.Note
Medium Risk
Moderate risk because it refactors the maintainer discovery pipeline (new ripgrep-based scanning, scoring, and AI fallback) and changes how identities are resolved for "unknown" usernames, which can affect maintainer ingestion and run-time behavior.
Overview
Improves maintainer file discovery by replacing the hard-coded filename scan with a multi-step detection pipeline: reuse the previously saved file, recursively search repo files via
ripgrepwith filename/content scoring and depth preference, and fall back to AI selection using scored candidates (with explicit third-party directory exclusions).Extends observability by recording
candidate_filesandai_suggested_fileonMaintainerResultand persisting them inServiceExecutionmetrics, and adjusts maintainer upsert logic to only skip truly unknown identities and to fall back to email-based identity lookup whengithub_usernameis"unknown". Also addsripgrepto the git-integration Docker runner image to support the new detection.Written by Cursor Bugbot for commit 56454b9. This will update automatically on new commits. Configure here.